The Link

In a web page, there may be numerous links to other pages. A web page is normally an HTML file with .htm or .html extension. It is actually a text file which is interpreted by the web browsers. This file follows the HTML syntax. the syntax for a link tag is as follows:

<a href = "other_page.html "> some_text </a>

The text in red are the keywords (or key characters) which are  not case sensitive. There may or may not be spaces before and after the "=" key character.
other_page.html is the name of the page linked to.
some_text is some text string

The "linked to" pages may have links to other pages and may even link to the previous page. Let us see an example:

A page index.html has links to
    1.
depatment.html
    2.
students.html
    3.
faculty.html

department.html
has links to
    1. cse.html
   
2. eee.html
   
3. me.html
   
4. civil.html
    5. index.html

students.html has links to
    1. organizations.html
   
2. facilities.html
   
3. groups.html
 

cse.html has links to
    1. intro.html
   
2. location.html
   
3. alumni.html
    4. index.html

Tree view:

   index   (level 0 page)
    |
    |_______
department (level 1 page) 
   
|                     |_______cse (level 2 page)
   
|                     |               |_______intro (level 3 page)
   
|                     |               |_______location
    |                     |               |_______
alumni
   
|                     |               |_______index
   
|                     |
   
|                     |_______eee
   
|                     |_______me
   
|                     |_______civil
    |                     |_______index
   
|
    |_______
students
   
|                     |_______organizations   
   
|                     |_______facilities   
   
|                     |_______groups   
   
|
   
|_______faculty
  

Your job is to parse HTML documents and retrieve the links from there. If above HTML files are given as input, your output should be

Level 0 page: index.html has links to
    1. depatment.html
    2. students.html
    3. faculty.html

Level 1 page: department.html has links to
    1. cse.html
    2. eee.html
    3. me.html
    4. civil.html
    5. index.html

Level 1 page: students.html has links to
    1. organizations.html
    2. facilities.html
    3. groups.html
 

Level 2 page: cse.html has links to
    1. intro.html
    2. location.html
    3. alumni.html
    4. index.html

Input:
            a. Name of the Level 0 page
            b. Maximum no of levels to follow
Output:
           as indicated above

Note: There is no limit to the total number of links.

Assumptions:
            a. All pages will be placed in the same directory/folder
            b. Page name is case insensitive
            c. There may be tags in the HTML file other than link tag. You should
                 ignore those tags. In fact, you should ignore anything other than
                 the link tag
            d. The links pointing to a lower level page should not be followed.       
                (otherwise your program will fall into an infinite loop).

Hints:
           Use a Queue to hold the name of the pages. Each entry in the queue may contain the following information:
   1. name of the page 
   2. Level of the page

Queue Implementation = 5
Parsing of link tags + file operation = 5
Showing output in desired order = 5
Avoidance of infinite recursion = 5
Overall program efficiency = 5

Total marks = 25

Back